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VARIABLE SCOPE PATENT SEARCHING BY AN INVERTED FILE TECHNIQUE 


INTRODUCTION 


The United States Patent Office is conducting 
an experiment in mechanized searching of patent 
literature which employs a technique similar to 
one previously found not suitable for its litera- 
ture search requirements. This technique is that 
of the coordinated or “inverted” file system for 
the coded information abstracted from the patent. 

Previously, ! this technique was found wholly 
inadequate for Patent Office operations and was 
abandoned in favor of the sequential or “normal” 
file arrangement. The inverted file at that time 
appeared to be unsatisfactory because of its ap- 
parent inability to retrieve information with pre- 
cisely interrelated concepts as is required and 
yet allow a searcher to request a selection on the 
basis of either the generic or the specific scope 
of each term or concept being sought. Since that 
time, search systems have been devised by the 
Patent Office using the “normal” file arrange- 
ment which yield the required degree of precision 
for depicting interrelationship and yet allow the 
searching to be done at selected degrees of breadth 
or specificity. 

The recent introduction of a small scale elec- 
tronic computer having large random access mem- 
ory has made possible the development of pro- 
cedures for incorporating many of the precision 
and variable scope features of the “normal” file 
systems into an “inverted” file system. 

The mechanized search system now being de- 
veloped involves the following features: 

(1) Parallel access searching in which only those 
portions of the file having subject matter perti- 
nent to each set of search terms are isolated for 
mechanical processing as contrasted to serial 
searching in which a sequential processing of all 
portions of the file is required. 

(2) Correlations are made amongst concepts or 
terms which individually are not restricted to the 
precise meaning of the terms as they appear in 
the dictionary but may be altered by instructing 
the stored program of the computer to generate 
within itself those files having the desired concep- 
tual meaning. 

(3) Dictionary terms are generated from the 
language of those documents comprising the file 
without using a prearranged hierarchical system of 
terms. 

The group of patents selected for this experi- 
ment comprises the chemical polymer art which 
involves organic and inorganic compounds as well 
48 properties, functions and processes associated 


therewith. This is an extension of the types of 
information handled in the Variable Scope Search 
System (VS3),” but the principles of recording pre- 
cise interrelationships of subject matter and the 
recognition of genus-species relationships have 
been adhered to in a substantial way. 


SHOWING RELATIONSHIPS IN SERIAL 
AND INVERTED FILES 


It is believed that the present experiment may 
be best described by concrete illustrations of the 
manner in which relationships, both of the inter- 
relational and the genus-species types are handled 
in the Patent Office. 

Let us assume that our dictionary of descriptors 
consists of but ten terms; for simplicity indescrip- 
tion. Furthermore, let these descriptors be limited 
to those applicable to certain chemical ring struc- 
tures only, for the same reason. Sucha dictionary 
might appear as follows: 


Code Name Structure 
1 phenyl oO 
2 pyrryl ae) 
3 furyl is 
4 pyridyl QO 
5 oxazolyl ey 
6 oxazinyl Cr 
7 pyranyl 'e 
8 a six-membered ring 
9 a nitrogen-containing ring 

10 an oxygen-containing ring 


The descriptors identified by 8, 9 and 10 are 
more generic in character than those identified 
by the numerals 1 to 7 since they may be properly 
applicable to one or more of the other descriptors. 

The descriptors 1 to 7 on the other hand, are 
actually “building blocks” or “fragments” of which 
one or more may be associated together to identify 
a chemical compound. 


ag 5 


The disclosure of a chemical patent or other 
document is usually more than a mere listing of 
chemical compounds, for each of the chemical 
compounds is associated with certain of the other 
compounds to form a definite process or chemical 
reaction chain in which each compound may have 
a role such as, starting material, final product, 
solvent, catalyst and the like. A-patent or docu- 
ment may also depict a number of different proc- 
esses each having its own set of mutually related 
notions or descriptors. 


Neither a serial nor an inverted file would 
be satisfactory in the Patent Office if only a sin- 
gle level of association of the applicable de- 
scriptors were to be had. This results from the 
fact that a single level of descriptor association, 
i.e., the level of the entire document, would allow 
retrieval of a host of documents which are non- 
pertinent to a normal search request because 
there is no ability to associate those descriptors 
which together identify a particular “fragment” 
of one compound as distinguished from those de- 
scriptors in any other “fragment,” or in any 
other compound or in any other process of that 
document. 

This can be illustrated by recourse to a 
series of hypothetical processes involving 
hypothetical compounds selected from our dic- 
tionary. 


Suppose hypothetical patent A were to disclose 


"a first reaction process having two compounds as 


follows: 


{{le. 9) un ‘ 9, 10) 
is 
(}- 
le 


In which limits of a patent 


[(2, 9) 4, 8, 9) (5, 9, 10)) } 


limits of a process 


as well as a second different reaction process 
having two other compounds as follows: 


Cre) +1) 


A serial file would consist of a heading identify- 
ing the patent number and followed or preceeded 
by all ten descriptors of our dictionary as follows: 


A 
A) Ze Cac See he/en ty C)5. Ke) 


An inverted file having only a single level of 
association would consist ofa series of ten headings 
identifiable with each descriptor of our dictionary 
with each heading followed by the number of the 
patent as follows: 


eifedeeree 8 9. 10 
A A A A A 


It is obvious that this patent would respond to 
every search request possible with our limited 
dictionary including such dissimilar requests as: 


eo +m 
The serial file approach was the first to most 
readily adapt itself to a multi-level association of 
descriptors so that all these precise relationships 
of the building blocks of a disclosure, andno more, 
were recorded. A technique of the mathematician, 
i.e., the bracket, was employed to establish the 


proper relationships. Patent number A would 
thus be recorded as follows: 


{[. 8) (3, 10)||(7, 8 10))}4] 


limits of a compound 


() = limits of a fragment of a compound 


Since the serial file was scanned one symbol at a 
time in sequence, it is relatively simple to cause 
a search.machine? to recognize the codes for the 
descriptors as being grouped within or without any 
limit specified in a question. Thus a compound 


[ (3) (8)] could be recognized as present in patent 


A but not compound | (1) (9)]. The serial search 


system has proved itself to be satisfactory and 
is now operational in the Patent Office for a small 
portion of the polymer file. The chief difficulty 


is the apparent inefficiencies of scanning all the 
information contained in any file for each search 
request; which is time consuming for present day 
search machines, but future machine developments 
may possibly remove this handicap without adding 
excessive cost factors. 

The availability of a large random access mem- 
ory computer appears to remove many of the con- 
straints which it was felt were present in the well 
known Batten card or coordinate index forms of 
implementing an inverted file. These constraints 
fundamentally resulted from the limited number of 
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entries possible on a single card or sheet, the dif- 
ficulty in posting new information and the manual 
manipulation of the cards or sheets. 

In the inverted file system now being prepared 
interrelationships between building blocks or struc- 
tural fragments are confined to the proper limits 
by adding to the end of the document number one 
or more arbitrarily assigned digits which reflect 


an association in a common process Or Com- 
pound4*5, Genus-species relationships may be 
sought for as extensively as desired by machine 
generating a generic file from a series of specific 
files through programmed manipulations executed 
on the specific files. 

Thus the inverted file for patent A of this simple 
example would be arranged as follows: 


alt. mat Du 3 4 0k as 8 2 10 
A-2-1 A-1-1 A-2-1 A-1-2 A-1-2 A-1-1 A-1-1 1 2 3 
A-1-2 A-2-2 4 4 5 

6 5 6 

7 6 7 


Actually the relationships expressed under 8, 9 
and 10 are wholly independent of the disclosure of 
any patent but may be regarded as a part of the 
chemical “grammar” common to all chemical docu- 
ments. These may be considered as “high level” 
terms. As many different levels of meanings as 
may be convenient can be employed. Actually in 
the polymer file three levels were used. 

With an inverted file system such as this a 
search for any compound would involve selecting 
only those portions of the total file which relate 
to the fragments involved in the sought for com- 
pound and correlating the complete document num- 
bers thereunder to thereby identify those docu- 
ment numbers common to all the fragments. If 
the search is for two or more compounds associ- 
ated in the same process, each compound is sep- 
arately processed and the resulting lists of docu- 
ment numbers again compared, but this time ig- 
noring the portion of the document numbers de- 
noting the compound number. Similarly, if two 
different processes are sought in the same patent, 
Correlations are made for each process as above 
and then followed by recorrelation of the docu- 
ment numbers ignoring both the compound and 
process numbers. 

If one or more of the fragments of a compound 
are identified by a generic descriptor the search 
proceeds as before except each such generic de- 
scriptor produces lists of other descriptors which 
In turn produce a consolidated list of all com- 
pounds in the system meeting the description of 
the generic term. The searcher does not need to 
know which species are members of the selected 
genus. 

The inverted file system now undergoing de- 
velopment utilizes three levels of descriptors in 
which the third or lowest level are descriptors 
of Specific compounds identified in the documents 
while the second or intermediate level contains 
descriptors of specific structural fragments of 
these compounds and the first or high level terms 


describe various mutual attributes of those frag- 
ments. Since thethird level terms areon a specific 
compound basis, it is not necessary to add to the 
document numbers more than the small arbitrary 
number indicative of the process in which the com- 
pounds are found. Furthermore, the patent numbers 
have been replaced by four digit accession numbers. 
This results in a file of five digit numbers, four 
for the accession number and one for the process 
number. 

Thus the search routine is merely one of making 
successive correlations of lists of five digit num- 
bers. 

A computer program has been developed to rec- 
ognize and make these correlations at the time of 
the search. This is done, in essence, by two 
routines called (1) merge and (2) match. 


(1) Merge 
The merge represents an “or” relationship. That 
is, things which are either 6 membered rings or 


nitrogen rings are discovered by merging the 
listings under 8 and 9 into one listing as follows: 


SDR [oo 
QaAu kh |o 


Ee Merge 


NOU MOE 


(2) Match 


The match provides an “and” relationship. Rings 
which are both 6 membered and nitrogen containing 


are obtained by matching identities of fragments 
listed in 8 and 9. 


aa Match 


4 
6 


SIDA phe foo 
AU mh [oO 


Combinations of merge and match 


These routines are ordinarily employed in com- 
bination. For example: 


(a) To find—compounds having a 6 membered 
nitrogen ring. 


eee | Merge 


A-1-1 
ARI=2 


An illustration will now be given of the search 
operator using the hypothetical disclosures and dic - 
tionary set forth in Appendix A. 

For convenience, one-digit codes are used to 
represent first level terms (generic), two-digit 
codes to represent second level terms (fragments) 
and three-digit codes to represent third level terms 
(compounds). Under each first level term is filed 
a series of two digit numbers identifying those 
second level terms correctly included as fragments 
within the genus of the first level term. Each 
second level term in turn heads a file of three 
digit numbers identifying all those third level terms 
corresponding to compounds containing that frag- 
ment. Each third level term finally heads a file of 
five digit numbers in which the first four digits rep- 
resent the number assigned to a document and the 
fifth digit is an arbitrary identification of the par- 
ticular chemical process in which the compound 
term on the third level is associated. Because of 
machine techniques the numbers representing these 
terms on all three levels are called “addresses” 
since they are used to locate the files in the mem- 
ory of the computer. 


For this example assume the search question as 
follows: 

Find all documents in which compounds A and B 
are in the same process as well as compounds 
C and D are likewise in a common process and in 
which A, B, C and D are each specified as follows: 


A--compound including both a 6-membered oxy- 
gen ring and 


a 0 fragment. 


B-compound including both a nitrogen containing 
ring anda 


halogen 


C-—compound comprises C==C —C==C—Cl 
D—compound includes both a 5 membered ring 
and a halogen 


Symbolically this question would be represented as: 


A B Cc D 


{{la9 (10)] {(4) (8)] | { [208) [ (2) @]}| 


The machine to be employed is the RAMAC 305 
for which a program is being developed which will 
accept a series of punched cards bearing the ad- 
dresses of those portions of the file which are to 
be investigated as well as information showing the 
logical grouping required by the question. 

The computer will then seek out the sets of data 
in its file corresponding to these addresses and 
perform a succession of merging, matching and 
reseeking operations until it arrives at the numbers 
of the documents satisfying the search requirement. 

The specific steps performed by the computer 
are diagramed in Appendix B for this particular 
search question. 

While not reported here, the actual search system 
being constructed for the polymer patents recog- 
nizes the role or function each compound plays in 
the total disclosure and is subject to retrieval on 
that basis as well as that of the compound identi- 
fication. 


CONCLUSIONS 


The system described appears to offer a promis- 
ing approach to the machine searching problem. 
Many problems remain to be solved however. For 
example, where it is required to find a process 
containing A+B and another process containing 
C+D, it is not yet possible to avoid retrieval of 
the invalid answer A+B+C+D, all in the same 
process. Similarly, a fragment answering two 
separate sets of descriptors will respond as an 
answer to both. 

Also, while 3 search levels only have been de- 
scribed, it is believed that more levels of search 


PiGe 


nile 


can be provided in order to encompass a more 
extensive or elaborate hierarchy. 

In addition, the system should be applicable, in 
principle, to subject matter outside the chemical 


Compositions of Matter.” A paper presented 
before the 113th meeting of the American Chem- 
ical Society, Chicago, April 1948. 
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APPENDIX A 


1st Level (Generic) Terms 


Address Descriptors Fragments 
1 6 membered ring LOS ae 22 a7 
2 5 membered ring 13, 14, 15,826 
3} O containing ring Ni, Wh, Ness Wy 
4 N containing ring NA ay ats, aly 
5 Alkyl OS LO ecOsmen 
6 Ethylenic unsaturated 22, 23, 24 
i Conjugated 235m 2 
8 Halogen 25, 26, 27 

ne 


Compounds 

100, 101, 109, 111 

102 

103 

104, 112 

105, 106 

106 7 


107 


109 


100, 101 
101 
106 
107, 109 
102, 110 


103, 1085) Ja ewe 


103, 105, 108 
104, 112 


Address 


100 


101 


102 


103 


104 


105 


106 


107 


108 


109 


110 


lil 


112 


3rd Level Terms 


(Compounds) 


Descriptors 


Cm 


(Cis =c-C=c-—cl 
ipa 

yc! 

yoy cee 
(ye-e-e-e 


c=c-c =c-cl 


Cl eenaa oni) 


Accession Nos. of Documents 


1000-0, 1003-1 

1000-0, 1000-1 

1000-1, 1001-0 

1001-0, 1003-2 

1000-2, 1001-1, 1005-1 
1002-1, 1004-0, 1005-1 
1002-0, 1003-0, 1005-1 
1003-0 

1001-1, 1004-0 

1001-0, 1002-0 

1000-0, 1002-0, 1004-0 
1000-1, 1000-2, 1003-0 


1000-2, 1001-1 


——_ = 


Accession Nos. Processes in Documents 


1000-0 


1000-1 


1000-2 


1001-0 


1001-1 


1002-0 


1002-1 


1003-0 


1003-1 


1003-2 


1004-0 


1005-1 


Cz woctn (@ec=c 
(OEc=c st Wer: + eo = 
Claes Chae Cyctere ser 
=C-c =c-cl cc 
or + +O Oe aa 
Br . 
(ay. +c=c-c=c~ecl + f@yee cre — CBr 


oyyees, ec oO Ba ene 
cre 

Cees ic. + (Mr%yc-e-€ 
Gi 

(Seas 


cl 
Cre 7S (Ga + cz=t-c=c-cl 


oF. 1 oe 


-10- 


Compound Codes 
(100 +110 +101) 
(101 +102 + 111) 
(111+ 104+ 112) 
(103 +102 +109) 
(104 + 108 + 112) 
(106 + 109 +110) 
(105) 

(107 + 111 +106) 
(100) 

(103) 

(110 + 105 +108) 


(104 +105 +106) 


APPENDIX B 


Teves 
10, 13 12 13 
1114 15 1h 
12m e16 16 15 
17 17 17 16 


fer] Gal Ga) Gs) bs} 3" 


104 104 105 106 107 
112 112 106 


ba} [a7] [ao] be) fas) Le) Gz] 
102 


109 100 103 106 107 109 


A 
PRE 
KOO 
HOP 


| 
-O 1000-2 1002-1 1000-2 1001-1 
-2 1001-1 1004-0 1001-1 1004-0 


1005-1 1005-1 
MATCH(5 Dig) i. AMBRGR RS Stacia 


1001-0 1000-2 
1001-1 
1002-1 
1004-0 
1005-1 


MATCH(5 Dig.) 
( 
1001-1 
1004-0 


MATCH(4 Dig.) 
__ 1001 
ANSWER 
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